Chapter 16. CGI and Perl
The Common Gateway Interface (CGI) is one of the oldest tools for connecting web sites to program logic, and it's still a common starting point. CGI provides a standard interface between the web server and applications, making it easier to write applications without having to build them directly into the server. Developers have been writing CGI scripts since the early days of the NCSA server, and Apache continues to support this popular and well-understood (if inefficient) mechanism for connecting HTTP requests to programs. While CGI scripts can be written in a variety of languages, the dominant language for CGI work has pretty much always been Perl. This chapter will explore CGI's capabilities, explain its integration with Apache, and provide a demonstration in Perl. 16.1 The World of CGIVery few serious sites nowadays can do without scripts in one way or another. If you want to interact with your visitors even as simply as "Hello John Doe, thanks for visiting us again" (done by checking his cookie (as described later in this chapter) against a database of names), you need to write some code. If you want to do any kind of business with him, you can hardly avoid it. If you want to serve up the contents of a database the stock of a shop or the articles of an encyclopedia a script might be a useful way to do it. Scripts are typically, though not always, interpreted, and they are generally an easier approach to gluing pieces together than the write and compile cycle of more formal programs. Writing scripts brings together a number of different packages and web skills whose documentation is sometimes hard to find. Until all of it works, none of it works; so we thought it might be useful to run through the basic elements here and to point readers at sources of further knowledge. 16.1.1 Writing and Executing ScriptsWhat is a script? If you're not a programmer, it can all be rather puzzling. A script is a set of instructions to do something, which are executed by the computer. To demonstrate what happens, get your computer to show its command-line prompt, start up a word processor, and type:
#! /bin/sh echo "have a nice day"
Save this as fred, and make it executable by doing: chmod +x fred
Run it with the following: ./fred @echo off echo "have a nice day" The odd first line turns off command-line echoing (to see what this means, omit it). Save this as the file fred.bat, and run it by typing fred. In both cases we get the cheering message have a nice day. If you have never written a program before you have now. It may seem one thing to write a program that you can execute on your own screen; it's quite another to write a program that will do something useful for your clients on the Web. However, we will leap the gap. 16.1.2 Scripts and ApacheA script that is going to be useful on the Web must be executed by Apache. There are two considerations here:
16.1.2.1 Executable scriptBear in mind that your CGI script must be executable in the opinion of your operating system. To test it, you can run it from the console with the same login that Apache uses. If it will not run, you have a problem that's signaled by disagreeable messages at the client end, plus equivalent stories in the log files on the server, such as: You don't have permission to access /cgi-bin/mycgi.cgi on this server 16.2 Telling Apache About the ScriptSince we have two different techniques here, we have two Config files: .../conf/httpd1.conf and .../conf/httpd2.conf . The script go takes the argument 1 or 2. You need to do either of the following: 16.2.1 Script in cgi-binUse ScriptAlias in your host's Config file, pointing to a safe location outside your web space. This makes for better security because the Bad Guys cannot read your scripts and analyze them for holes. "Security by obscurity" is not a sound policy on its own, but it does no harm when added to more vigorous precautions. To steer incoming demands for the script to the right place (.../cgi-bin), we need to edit our ... /site.cgi/conf/httpd1.conf file so it looks something like this: User webuser Group webgroup ServerName www.butterthlies.com #for scripts in ../cgi-bin ScriptAlias /cgi-bin /usr/www/APACHE3/cgi-bin DirectoryIndex /cgi-bin/script_html You would probably want to proceed in this way, that is, putting the script in the cgi-bin directory (which is not in /usr/www/APACHE3/site.cgi/htdocs), if you were offering a web site to the outside world and wanted to maximize your security. Run Apache to use this script with the following: ./go 1 You would access this script by browsing to http://www.butterthlies.com/cgi-bin/mycgi.cgi. 16.2.2 Script in DocumentRootThe other method is to put scripts in among the HTML files. You should only do this if you trust the authors of the site to write safe scripts (or not write them at all) since security is much reduced. Generally speaking, it is safer to use a separate directory for scripts, as explained previously. First, it means that people writing HTML can't accidentally or deliberately cause security breaches by including executable code in the web tree. Second, it makes life harder for the Bad Guys: often it is necessary to allow fairly wide access to the nonexecutable part of the tree, but more careful control can be exercised on the CGI directories. We would not suggest you do this unless you absolutely have to. But regardless of these good intentions, we put mycgi.cgi in.../site.cgi/htdocs. The Config file, ... /site.cgi/conf/httpd2.conf, is now: User webuser Group webgroup ServerName www.butterthlies.com DocumentRoot /usr/www/APACHE3/site.cgi/htdocs AddHandler cgi-script cgi Options ExecCGI Use Addhandler to set a handler type of cgi-script with the extension .cgi. This means that any document Apache comes across with the extension.cgi will be taken to be an executable script.You put the CGI scripts, called <name>.cgi in your document root. You also need to have Options ExecCGI . To run this one, type the following: ./go 2 You would access this script by browsing to http://www.butterthlies.com/cgi-bin/mycgi.cgi. To experiment, we have a simple test script, mycgi.cgi, in two locations: .../cgi-bin to test the first method and.../site.cgi/htdocs to test the second. When it works, we would write the script properly in C or Perl or whatever.
The script mycgi.cgi looks like this: #!/bin/sh echo "Content-Type: text/plain" echo echo "Have a nice day"
Under Win32, providing you want to run your script under COMMAND.COM and call it mycgi.bat, the script can be a little simpler than the Unix version it doesn't need the line that specifies the shell: @echo off echo "Content-Type: text/plain" echo. echo "Have a nice day"
The @echo off command turns off command-line echoing, which would otherwise completely destroy the output of the batch file. The slightly weird-looking echo. gives a blank line (a plain echo without a dot prints ECHO is off).
If you are running a more exotic shell, like bash or perl, you need the "shebang" line at the top of the script to invoke it. These must be the very first characters in the file: #!shell path ... 16.2.3 PerlYou can download Perl for free from http://www.perl.org. Read the README and INSTALL files and do what they say. Once it is installed on a Unix system, you have an online manual. perldoc perldoc explains how the manual system works. perldoc -f print, for example, explains how the function print works; perldoc -q print finds "print" in the Perl FAQ. A simple Perl script looks like this: #! /usr/local/bin/perl -wT use strict; print "Hello world\n"; The first line, the "shebang" line, loads the Perl interpreter (which might also be in /usr/bin/perl) with the -wT flag, which invokes warnings and checks incoming data for "taint." Tainted data could have come from Bad Guys and contain malicious program in disguise. -T makes sure you have always processed everything that comes from "outside" before you use it in any potentially dangerous functions. For a fuller explanation of a complicated subject, see Programming Perl by Larry Wall, Jon Orwant, and Tom Christiansen (O'Reilly, 2000). There isn't any input here, so -T is not necessary, but it's a good habit to get into. The second line loads the strict pragma: it imposes a discipline on your code that is essential if you are to write scripts for the Web. The third line prints "Hello world" to the screen. Having written this, saved it as hello.pl and made it executable with chmod +x hello.pl, you can run it by typing ./hello.pl. Whenever you write a new script or alter an old one, you should always run it from the command line first to detect syntax errors. This applies even if it will normally be run by Apache. For instance, take the trailing " off the last line of hello.pl, and run it again: Can't find string terminator '"' anywhere before EOF at ./hello.pl line 4 16.2.4 DatabasesMany serious web sites will need a database in back. In the authors' experience, an excellent choice is MySQL, freeware made in Scandinavia by intelligent and civilized people. Download it from http://www.mysql.com. It uses a variant of the more-or-less standard SQL query language. You will need a book on SQL: Understanding SQL by Martin Gruber (Sybex, 1990) tells you more than you need to know, although the SQL syntax described is sometimes a little different from MySQL's. Another option is SQL in a Nutshell by Kevin Kline (O'Reilly, 2000). MySQL is fast, reliable, and so easy to use that a lot of the time you can forget it is there. You link to MySQL from your scripts through the DBI module. Download it from CPAN (http://www.cpan.org/) if it doesn't come with Perl. You will need some documentation on DBI try http://www.symbolstone.org/technology/perl/DBI/doc/faq.html. There is also an O'Reilly book on DBI, Programming the Perl DBI by Alligator Descartes and Tim Bunce. In practice, you don't need to know very much about DBI because you only need to access it in five different ways. See the lines marked 'A', 'B', 'C', 'D', and 'E' in script as follows: 'A' to open a database 'B' to execute a single command - which could equally well have been typed at the keyboard as a MySQL command line. 'C' to retrieve, display, process fields from a set of database records. A very nice thing about MySQL is that you can use the 'select *' command, which will make all the fields available via the $ref->{'<fieldname>'} mechanism. 'D' Free up a search handle 'E' Disconnect from a database If you forget the last two, it can appear not to matter since the database disconnect will be automatic when the Perl script terminates. However, if you then move to mod_perl (discussed in Chapter 17), it will matter a lot since you will then accumulate large numbers of memory-consuming handles. And, if you have very new versions of MySQL and DBI, you may find that the transaction is automatically rolled back if you exit without terminating the query handle. This previous script assumes that there is a database called people. Before you can get MySQL to work, you have to set up this database and its permissions by running: mysql mysql < load_database where load_database is the script .../cgi-bin/load_database: create database people; INSERT INTO db VALUES ('localhost','people','webserv','Y','Y','Y','Y','N','N','N','N','N','N'); INSERT INTO user VALUES ('localhost','webserv','','Y','Y','Y','Y','N','N','N','N','N','N','N','N','N','N'); INSERT INTO user VALUES ('<IP address> ','webserv','','Y','Y','Y','Y','N','N','N','N','N','N','N','N','N','N'); You then have to restart with mysqladmin reload to get the changes to take effect. Newer versions of MySQL may support the Grant command, which makes things easier. You can now run the next script, which will create and populate the table people: mysql people < load_people The script is .../cgi-bin/load_people: # MySQL dump 5.13 # # Host: localhost Database: people #-------------------------------------------------------- # Server version 3.22.22 # # Table structure for table 'people' # CREATE TABLE people ( xname varchar(20), sname varchar(20) ); # # Dumping data for table 'people' # INSERT INTO people VALUES ('Jane','Smith'); INSERT INTO people VALUES ('Anne','Smith'); INSERT INTO people VALUES ('Anne-Lise','Horobin'); INSERT INTO people VALUES ('Sally','Jones'); INSERT INTO people VALUES ('Anne-Marie','Kowalski'); It will be found in .../cgi-bin. Another nice thing about MySQL is that you can reverse the process by: mysqldump people > load_people This turns a database into a text file that you can read, archive, and upload onto other sites, and this is how the previous script was created. Moreover, you can edit self contained lumps out of it, so that if you wanted to copy a table alone or the table and its contents to another database, you would just lift the commands from the dump file. We now come to the Perl script that exercises this database. To begin with, we ignore Apache. It is .../cgi-bin/script: #! /usr/local/bin/perl -wT use strict; use DBI( ); my ($mesg,$dbm,$query,$xname,$sname,$sth,$rows,$ref); $sname="Anne Jane"; $xname="Beauregard"; # Note A above: open a database $dbm=DBI->connect("DBI:mysql:database=people;host=localhost",'webuser') or die "didn't connect to people"; #insert some more data just to show we can $query=qq(insert into people (xname,sname) values ('$xname',$sname')); #Note B above: execute a command $dbm->do($query); # get it back $xname="Anne"; $query=qq(select xname, sname from people where xname like "%$xname%"); #Note C above: $sth=$dbm->prepare($query) or die "failed to prepare $query: $!"; # $! is the Perl variable for the current system error message $sth->execute; $rows=$sth->rows; print qq(There are $rows people with names matching '$xname'\n); while ($ref=$sth->fetchrow_hashref) { print qq($ref->{'xname'} $ref->{'sname'}\n); } #D: free the search handle $sth->finish; #E: close the database connection $dbm->disconnect; Stylists may complain that the $dbm->prepare($query) lines, together with some of the quoting issues, can be neatly sidestepped by code like this: $surname="O'Reilly"; $forename="Tim"; ... $dbm->do('insert into people(xname,sname) values (?,?)',{},$forename,$surname); The effect is that DBI fills in the ?s with the values of the $forename, $surname variables. However, building a $query variable has the advantage that you can print it to the screen to make sure all the bits are in the right place and you can copy it by hand to the MySQL interface to make sure it works before you unleash the line: $sth=$dbm->prepare($query) The reason for doing this is that a badly formed database query can make DBI or MySQL hang. You'll spend a long time staring at a blank screen and be no wiser. For the moment, we ignore Apache. When you run script by typing ./script, it prints: There are 4 people with names matching 'Anne' Anne Smith Anne-Lise Horobin Anne Jane Beauregard Anne-Marie Kowalski Each time you run this, you add another Beauregard, so the count goes up. MySQL provides a direct interface from the keyboard, by typing (in this case) mysql people. This lets you try out the queries you will write in your scripts. You should try out the two $querys in the previous script before running it. 16.2.5 HTMLThe script we just wrote prints to the screen. In real life we want it to print to the visitor's screen via her browser. Apache gets it to her, but to get the proper effect, we need to send our data wrapped in HTML codes. HTML is not difficult, but you will need a thorough book on it,[1] because there are a large number of things you can do, and if you make even the smallest mistake, the results can be surprising as browsers often ignore badly formed HTML. All browsers will put up with some harmless common mistakes, like forgetting to put a closing </body></html> at the end of a page. Strictly speaking, attributes inside HTML tags should be in quotes, thus: <A target="MAIN"...> <Font color="red"...> However, the browsers do not all behave in the same way. MSIE, for instance, will tolerate the absence of a closing </form> or </table> tags, but Netscape will not. The result is that pages will, strangely, work for some visitors and not for others. Another trap is that when you use Apache's ability to pass extra data in a link when CGI has been enabled by ScriptAlias: <A HREF="/my_script/data1/data2"> (which results in my_script being run and /data1/data2 appearing in the environment variable PATH_INFO), one browser will tolerate spaces in the data, and the other one will not. The moral is that you should thoroughly test your site, using at least the two main browsers (MSIE and Netscape) and possibly some others. You can also use an HTML syntax checker like WebLint, which has many gateways, e.g., http://www.ews.uiuc.edu/cgi-bin/weblint, or Dr. HTML at http://www2.imagiware.com/RxHTML/. 16.2.6 Running a Script via ApacheThis time we will arrange for Apache to run the script. Let us adapt the previous script to print a formatted list of people matching the name "Anne." This version is called .../cgi-bin/script_html. #! /usr/local/bin/perl -wT use strict; use DBI( ); my ($ref,$mesg,$dbm,$query,$xname,$sname,$sth,$rows); #print HTTP header print "content-type: text/html\n\n"; # open a database $dbm=DBI->connect("DBI:mysql:database=people;host=localhost",'webserv') or die "didn't connect to people"; # get it back $xname="Anne"; $query=qq(select xname, sname from people where xname like "%$xname%"); $sth=$dbm->prepare($query) or die "failed to prepare $query: $!"; # $! is the Perl variable for the current system error message $sth->execute; $rows=$sth->rows; #print HTML header print qq(<HTML><HEAD><TITLE>People's names</TITLE></HEAD><BODY> <table border=1 width=70%><caption><h3>The $rows People called '$xname'</h3></caption> <tr><align left><th>First name</th><th>Last name</th></tr>); while ($ref=$sth->fetchrow_hashref) { print qq(<tr align = right><td>$ref->{'xname'}</td><td> $ref->{'sname'}</td></tr>); } print "</table></BODY></HTML>"; $sth->finish; # close the database connection $dbm->disconnect; 16.2.7 Quote MarksThe variable that contains the database query is the $query string. Within that we have the problem of quotes. Perl likes double quotes if it is to interpolate a $ or @ value; MySQL likes quotes of some sort around a text variable. If we wanted to search for the person whose first name is in the Perl variable $xname, we could use the query string: $query="select * from people where xname='$xname'"; This will work and has the advantage that you can test it by typing exactly the same string on the MySQL command line. It has the disadvantages that while you can, mostly, orchestrate pairs of '' and " ", it is possible to run out of combinations. It has the worse disadvantage that if we allow clients to type a name into their browser that gets loaded into $xname, the Bad Guys are free to enter a name larded with quotes of their own, which could do undesirable things to your system by allowing them to add extra SQL to your supposedly innocuous query. Perl allows you to open up the possibilities by using the qq( ) construct, which has the effect of double external quotes: $query=qq(select * from people where xname="$xname"); We can then go on to the following: $sth=$dbm->prepare($query) || die $dbm->errstr; $sth->execute($query); But this doesn't solve the problem of attackers planting malicious SQL in $xname. A better method still is to use MySQL's placeholder mechanism. (See perldoc DBI.) We construct the query string with a hole marked by ? for the name variable, then supply it when the query is executed. This has the advantage that no quotes are needed in the query string at all, and the contents of $xname completely bypass the SQL parsing, which means that extra SQL cannot be added via that route at all. (However, note that it is good practice always to vet all user input before doing anything with it.) Furthermore, database access runs much faster since preparing the query only has to happen once (and query optimization is often also performed at this point, which can be an expensive operation). This is particularly important if you have a busy web site doing lookups on different things: $query=qq(select * from people where xname=?); $sth=$dbm->prepare($query) || die $dbm->errstr; When you want the database lookup to happen, you write: $sth->execute($query,$xname); This has an excellent impact on speed if you are doing the database accesses in a loop. In the script script: first we print the HTTP header more about this will follow. Then we print the HTML header, together with the caption of the table. Each line of the table is printed separately as we search the database, using the DBI function fetchrow_hashref to load the variable $ref. Finally, we close the table (easily forgotten, but things can go horribly wrong if you don't) and close the HTML. #! /usr/local/bin/perl -wT use strict; use DBI( ); my ($ref,$mesg,$dbm,$query,$xname,$sname,$sth,$rows); $xname="Anne Jane"; $sname="Beauregard"; # open a database $dbm=DBI->connect("DBI:mysql:database=people;host=localhost",'webserv') or die "didn't connect to DB people"; #insert some more data just to show we can # demonstrate qq( ) $query=qq(insert into people (xname,sname) values ('$xname','$sname')); $dbm->do($query); # get it back $xname="Anne"; #demonstrate DBI placeholder $query=qq(select xname, sname from people where xname like ?); $sth=$dbm->prepare($query) or die "failed to prepare $query: $!"; # $! is the Perl variable for the current system error message #Now fill in the placeholder $sth->execute($query,$xname); $rows=$sth->rows; print qq(There are $rows people with names matching '$xname'\n); while ($ref=$sth->fetchrow_hashref) { print qq($ref->{'xname'} $ref->{'sname'}\n); } $sth->finish; # close the database connection $dbm->disconnect; This script produces a reasonable looking page. Once you get it working, development is much easier. You can edit it, save it, refresh from the browser, and see the new version straight away. Use ./go 1 and browse to http://www.butterthlies.com to see a table of girls called "Anne." This works because in the Config file we declared this script as the DirectoryIndex. In this way we don't need to provide any fixed HTML at all. 16.2.8 HTTP HeaderOne of the most crucial elements of a script is also hard to see: the HTTP header that goes ahead of everything else and tells the browser what is coming. If it isn't right, nothing happens at the far end. A CGI script produces headers and a body. Everything up to the first blank line (strictly speaking, CRLF CRLF, but Apache will tolerate LF LF and convert it to the correct form before sending to the browser) is header, and everything else is body. The lines of the header are separated by LF or CRLF. The CGI module (if you are using it) and Apache will send all the necessary headers except the one you need to control. This is normally: print "Content-Type: text/html\n\n"; If you don't want to send HTML but ordinary text as if to your own screen, use the following: print "Content-Type: text/plain\n\n"; Notice the second \n (C and Perl for newline), which terminates the headers (there can be more than one; each on its own line), which is always essential to make the HTTP header work. If you find yourself looking at a blank browser screen, suspect the HTTP header. If you want to force your visitor's browser to go to another URL, include the following line: print "Location: http://URL\n\n" CGIs can emit almost any legal HTTP header (note that although "Location" is an HTTP header, using it causes Apache to return a redirect response code as well as the location specified this is a special case for redirects). A complete list of HTTP headers can be found in section 14 of RFC2616 (the HTTP 1.1 specification), http://www.ietf.org/rfc/rfc2616.txt. 16.2.9 Getting Data from the ClientOn many sites in real life, we need to ask the visitor what he wants, get the information back to the server, and then do something with it. This, after all, is the main mechanism of e-commerce. HTML provides one standard method for getting data from the client: the Form. If we use the HTML Method='POST' in the form specification, the data the user types into the fields of the form is available to our script by reading stdin. In POST-based Perl CGI scripts, this data can be read into a variable by setting it equal to <>: my ($data); $data=<>; We can then rummage about in $data to extract the values type in by the user. In real life, you would probably use the CGI module, downloaded from CPAN (http://cpan.org), to handle the interface between your script and data from the form. It is easier and much more secure than doing it yourself, but we ignore it here because we want to illustrate the basic principles of what is happening. We will add some code to the script to ask questions. One question will ask the reader to click if they want to see a printout of everyone in the database. The other will let them enter a name to replace "Anne" as the search criterion listed earlier. It makes sense to use the same script to create the page that asks for input and then to handle that input once it arrives. The trick is to test the input channels for data at the top of the script. If there is none, it asks questions; if there is some, it gives answers. 16.2.9.1 Data from a linkIf your Apache Config file invokes CGI processing with the directive ScriptAlias, you can construct links in your HTML that have extra data passed with them as if they were directory names passed in the Environment variable PATH_INFO. For instance: ... <A HREF="/cgi-bin/script2_html/whole_database">Click here to see whole database</A> ... When the user clicks on this link she invokes script2_html and makes available to it the Environment variable PATH_INFO, containing the string /whole_database. We can test this in our Perl script with this: if($ENV{'PATH_INFO'} eq '/whole_database') { #do something } Our script can then make a decision about what to do next on the basis of this information. The same mechanism is available with the HTML FORM ACTION attribute. We might set up a form in our HTML with the command: <FORM METHOD='POST' ACTION="/cgi-bin/script2_html/receipts"> As previously, /receipts will turn up in PATH_INFO, and your script knows which form sent the data and can go to the appropriate subroutine to deal with it. What happens inside Apache is that the URI /cgi-bin/script2_html/receipts is parsed from right to left, looking for a filename, which does not have to be a CGI script. The material to the right of the filename is passed in PATH_INFO. 16.2.9.2 CGI.pmThe Perl module called CGI.pm does everything we discuss and more. Many professionals use it, and we are often asked why we don't show it here. The answer is that to get started, you need to know what is going on under the hood and that is what we cover here. In fact, I tried to start with CGI.pm and found it completely baffling. It wasn't until I abandoned it and got my hands in the cogs that I understood how the interaction between the client's form and the server's script worked. When you understand that, you might well choose to close the hood in CGI.pm. But until then, it won't hurt to get to grips with the underlying process. 16.2.9.3 Questions and answersSince the same script puts up a form that asks questions and also retrieves the answers to those questions, we need to be able to tell in which phase of the operation we are. We do that by testing $data to find out whether it is full or empty. If it is full, we find that all the data typed into the fields of the form by the user are there, with the fields separated by &. For instance, if the user had typed "Anne" into the first-name box and "Smith" into the surname box, this string would arrive: xname=Anne&sname=Smith or, if the browser is being very correct: xname=Anne;sname=Smith We have to dissect it to answer the customer's question, but this can be a bit puzzling. Not only is everything crumpled together, various characters are encoded. For instance, if the user had typed "&" as part of his response, e.g., "Smith&Jones", it would appear as "Smith%26Jones". You will have noticed that "26" is the ASCII code in hexadecimal for "&". This is called URL encoding and is documented in the HTTP RFC. "Space" comes across as "+" or possibly "%20". For the moment we ignore this problem. Later on, when you are writing real applications, you would probably use the "unescape" function from CGI.pm to translate these characters. The strategy for dealing with this stuff is to:
See the first few lines of the following subroutine get_name( ). This is the script .../cgi-bin/script2_html, which asks questions and gets the answers. There are commented out debugging lines scattered through the script, such as: #print "in get_name: ARGS: @args, DATA: $data<BR>"; Put these in to see what is happening, then turn them off when things work. You may like to leave them in to help with debugging problems later on. Another point of style: many published Perl programs use $dbh for the database handle; we use $dbm: #! /usr/local/bin/perl -wT use strict; use DBI( ); use CGI; use CGI::Carp qw(fatalsToBrowser); my ($data,@args); $data=<>; if($data) { &get_name($data); } elsif($ENV{'PATH_INFO'} eq "/whole_database") { $data="xname=%&sname=%"; &get_name( ); } else { &ask_question; } print "</BODY></HTML>"; sub ask_question { &print_header("ask_question"); print qq(<A HREF="/cgi-bin/script2_html/whole_database"> Click here to see the whole database</A> <BR><FORM METHOD='POST' ACTION='/cgi-bin/script2_html/name'> Enter a first name <INPUT TYPE='TEXT' NAME='xname' SIZE=20><BR> and or a second name <INPUT TYPE='TEXT' NAME='sname' SIZE=20><BR> <INPUT TYPE=SUBMIT VALUE='ENTER'>); } sub print_header { print qq(content-type: text/html\n\n <HTML><HEAD><TITLE>$_[0]</TITLE></HEAD><BODY>); } sub get_name { my ($t,@val,$ref, $mesg,$dbm,$query,$xname,$sname,$sth,$rows); &print_header("get_name"); #print "in get_name: ARGS: @args, DATA: $data<BR>"; $xname="%"; $sname="%"; @args=split(/&/,$data); foreach $t (@args) { @val=split(/=/,$t); if($val[0] eq "xname") { $xname=$val[1] if($val[1]); } elsif($val[0] eq "sname") { $sname=$val[1] if($val[1]); } } # open a database $dbm=DBI->connect("DBI:mysql:database=people;host=localhost",'webserv') or die "didn't connect to people"; # get it back $query=qq(select xname, sname from people where xname like ? and sname like ?); $sth=$dbm->prepare($query) or die "failed to prepare $query: $!"; #print "$xname, $sname: $query<BR>"; # $! is the Perl variable for the current system error message $sth->execute($xname,$sname) or die "failed to execute $dbm->errstr( )<BR>"; $rows=$sth->rows; #print "$rows: $rows $query<BR>"; if($sname eq "%" && $xname eq "%") { print qq(<table border=1 width=70%><caption><h3>The Whole Database (3)</h3></ caption>); } else { print qq(<table border=1 width=70%><caption><h3>The $rows People called $xname $sname</h3></caption>); } print qq(<tr><align left><th>First name</th><th>Last name</th></tr>); while ($ref=$sth->fetchrow_hashref) { print qq(<tr align right><td>$ref->{'xname'}</td><td> $ref->{'sname'}</td></tr>); } print "</table></BODY></HTML>"; $sth->finish; # close the database connection $dbm->disconnect; } The Config file is ...site.cgi/httpd3.conf. User webuser Group webgroup ServerName www.butterthlies.com DocumentRoot /usr/www/APACHE3/APACHE3/site.cgi/htdocs # for scripts in .../cgi-bin /cgi-bin /usr/www/APACHE3/APACHE3/cgi-bin DirectoryIndex /cgi-bin/script2_html Kill Apache and start it again with ./go 3. The previous script handles getting data to and from the user and to and from the database. It encapsulates the essentials of an active web site whatever language it is written in. The main missing element is email see the following section. 16.2.10 Environment VariablesEvery request from a browser brings a raft of information with it to Apache, which reappears as environment variables. It can be very useful to have a subroutine like this: sub print_env { foreach my $e (keys %ENV) { print "$e=$ENV{$e}\n"; } } If you call it at the top of a web page, you see something like this on your browser screen: SERVER_SOFTWARE = Apache/1.3.9 (Unix) mod_perl/1.22 GATEWAY_INTERFACE = CGI/1.1 DOCUMENT_ROOT = /usr/www/APACHE3/MedicPlanet/site.medic/htdocs REMOTE_ADDR = 192.168.123.1 SERVER_PROTOCOL = HTTP/1.1 SERVER_SIGNATURE = REQUEST_METHOD = GET QUERY_STRING = HTTP_USER_AGENT = Mozilla/4.0 (compatible; MSIE 4.01; Windows 95) PATH = /sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin: /usr/X11R6/bin:/root/bin HTTP_ACCEPT = image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/msword, application/vnd.ms-powerpoint, */* HTTP_CONNECTION = Keep-Alive REMOTE_PORT = 1104 SERVER_ADDR = 192.168.123.5 HTTP_ACCEPT_LANGUAGE = en-gb SCRIPT_NAME = HTTP_ACCEPT_ENCODING = gzip, deflate SCRIPT_FILENAME = /usr/www/APACHE3/MedicPlanet/cgi-bin/MP_home SERVER_NAME = www.Medic-Planet-here.com PATH_INFO = / REQUEST_URI = / HTTP_COOKIE = Apache=192.168.123.1.1811957344309436; Medic-Planet=8335562231 SERVER_PORT = 80 HTTP_HOST = www.medic-planet-here.com PATH_TRANSLATED = /usr/www/APACHE3/MedicPlanet/cgi-bin/MP_home/ SERVER_ADMIN = [no address given All of these environment variables are available to your scripts via $ENV. For instance, the value of $ENV{'GATEWAY_INTERFACE'} is 'CGI/1.1' as you can see earlier. Environment variables can also be used to control some aspects of the behavior of Apache. Note that because these are just variables, nothing checks that you have spelled them correctly, so be very careful when using them. 16.3 Setting Environment VariablesWhen a script is called, it receives a lot of environment variables, as we have seen. It may be that you want to invent and pass some of your own. There are two directives to do this: SetEnv and PassEnv.
This directive sets an environment variable that is then passed to CGI scripts. We can create our own environment variables and give them values. For instance, we might have several virtual hosts on the same machine that use the same script. To distinguish which virtual host called the script (in a more abstract way than using the HTTP_HOST environment variable), we could make up our own environment variable VHOST: <VirtualHost host1> SetEnv VHOST customers ... </VirtualHost> <VirtualHost host2> SetEnv VHOST salesmen ... </VirtualHost>
This directive takes a list of environment variables and removes them.
This directive passes an environment variable to CGI scripts from the environment that was in force when Apache was started.[2] The script might need to know the operating system, so you could use the following: PassEnv OSTYPE This variation assumes that your operating system sets OSTYPE, which is by no means a foregone conclusion. 16.4 CookiesIn the modern world of fawningly friendly e-retailing, cookies play an essential role in allowing web sites to recognize previous users and to greet them like long-lost, rich, childless uncles. Cookies offer the webmaster a way of remembering her visitors. The cookie is a bit of text, often containing a unique ID number, that is contained in the HTTP header. You can get Apache to concoct and send it automatically, but it is not very hard to do it yourself, and then you have more control over what is happening. You can also get Perl modules to help: CGI.pm and CGI::Cookie. But, as before, we think it is better to start as close as you can to the raw material. The client's browser keeps a list of cookies and web sites. When the user goes back to a web site, the browser will automatically return the cookie, provided it hasn't expired. If a cookie does not arrive in the header, you, as webmaster, might like to assume that this is a first visit. If there is a cookie, you can tie up the site name and ID number in the cookie with any data you stored the last time someone visited you from that browser. For instance, when we visit Amazon, a cozy message appears: "Welcome back Peter or Ben Laurie," because the Amazon system recognizes the cookie that came with our HTTP request because our browser looked up the cookie Amazon sent us last time we visited. A cookie is a text string. It's minimum content is Name=Value, and these can be anything you like, except semicolon, comma, or whitespace. If you absolutely must have these characters, use URL encoding (described earlier as "&" = "%26", etc.). A useful sort of cookie would be something like this: Butterthlies=8335562231 Butterthlies identifies the web site that issued it necessary on a server that hosts many sites. 8335562231 is the ID number assigned to this visitor on his last visit. To prevent hackers upsetting your dignity by inventing cookies that turn out to belong to other customers, you need to generate a rather large random number from an unguessable seed,[3] or protect them cryptographically. These are other possible fields in a cookie:
The fields are separated by semicolons, thus: Butterthlies=8335562231; expires=Mon, 27-Apr-2020 13:46:11 GMT An incoming cookie appears in the Perl variable $ENV{'HTTP_COOKIE'}. If you are using CGI.pm, you can get it dissected automatically; otherwise, you need to take it apart using the usual Perl tools, identify the user and do whatever you want to do to it. To send a cookie, you write it into the HTTP header, with the prefix Set-Cookie: Set-Cookie: Butterthlies=8335562231;expires=Mon, 27-Apr-2020 13:46:11 GMT And don't forget the terminating \n, which completes the HTTP headers. It has to be said that some people object to cookies but do they mind if the bartender recognizes them and pours a Bud when they go for a beer? Some sites find it worthwhile to announce in their Privacy Statement that they don't use them. 16.4.1 Apache CookiesBut you can, if you wish, get Apache to handle the whole thing for you with the directives that follow. In our opinion, Apache cookies are really only useful for tracking visitors through the site for after-the-fact log file analysis. To recapitulate: if a site is serving cookies and it gets a request from a user whose browser doesn't send one, the site will create one and issue it. The browser will then store the cookie for as long as CookieExpires allows (see later) and send it every time the user goes to your URL. However, all Apache does is store the user's cookie in the appropriate log. You have to discover that it's there and do something about it. This will necessarily involve a script (and quite an awkward one too since it has to trawl the log files), so you might just as well do the whole cookie thing in your script and leave these directives alone: it will probably be easier.
CookieName allows you to set the name of the cookie served out. The default name is Apache. The new name can contain the characters A-Z, a-z, 0-9, _, and -.
CookieLog sets a filename relative to the server rootfor a file in which to log the cookies. It is more usual to configure a field with LogFormat and catch the cookies in the central log (see Chapter 10).
This directive sets an expiration time on the cookie. Without it, the cookie has no expiration date not even a very faraway one and this means that it evaporates at the end of the session. The expiry-period can be given as a number of seconds or in a format such as "2 weeks 3 days 7 hours". If the second format is used, the string must be enclosed in double quotes. Valid time periods are as follows:
16.4.2 The Config FileThe Config file is as follows: User webuser Group webgroup ServerName my586 DocumentRoot /usr/www/APACHE3/site.first/htdocs TransferLog logs/access_log CookieName "my_apache_cookie" CookieLog logs/CookieLog CookieTracking on CookieExpires 10000 In the log file we find: 192.168.123.1.5653981376312508 "GET / HTTP/1.1" [05/Feb/2001:12:31:52 +0000] 192.168.123.1.5653981376312508 "GET /catalog_summer.html HTTP/1.1" [05/Feb/2001:12:31:55 +0000] 192.168.123.1.5653981376312508 "GET /bench.jpg HTTP/1.1" [05/Feb/2001:12:31:55 +0000] 192.168.123.1.5653981376312508 "GET /tree.jpg HTTP/1.1" [05/Feb/2001:12:31:55 +0000] 192.168.123.1.5653981376312508 "GET /hen.jpg HTTP/1.1" [05/Feb/2001:12:31:55 +0000] 192.168.123.1.5653981376312508 "GET /bath.jpg HTTP/1.1" [05/Feb/2001:12:31:55 +0000] 16.4.3 EmailFrom time to time a CGI script needs to send someone an email. If it's via a link selected by the user, use the HTML construct: <A HREF="mailto:administrator@butterthlies.com">Click here to email the administrator</A> The user's normal email system will start up, with the address inserted. If you want an email to be sent automatically, without the client's collaboration or even her knowledge, then use the Unix sendmail program (see man sendmail). To call it from Perl (A is an arbitrary filename): open A, "| sendmail -t" or die "couldn't open sendmail pipe $!"; A Win32 equivalent to sendmail seems to be at http://pages.infinit.net/che/blat/blat_f.html. However, the pages are in French. To download, click on "ici" in the line: Une version rιcente est ici. Alternatively, and possibly safer to use, there is the CPAN Mail::Mailer module. The format of an email is pretty well what you see when you compose one via Netscape or MSIE: addressee, copies, subject, and message appear on separate lines; they are written separated by \n. You would put the message into a Perl variable like this: $msg=qq(To:fred@hissite.com\nCC:bill@elsewhere.com\nSubject:party tonight\n\nBe at Jane's by 8.00\n); Notice the double \n at the end of the email header. When the message is all set up, it reads: print A $msg close A or die "couldn't send email $!"; and away it goes. 16.4.4 Search Engines and CGIMost webmasters will be passionately anxious that their creations are properly indexed by the search engines on the Web, so that the teeming millions may share the delights they offer. At the time of writing, the search engines were coming under a good deal of criticism for being slow, inaccurate, arbitrary, and often plain wrong. One of the more serious criticisms alleged that sites that offered large numbers of separate pages produced by scripts from databases (in other words, most of the serious e-commerce sites) were not being properly indexed. According to one estimate, only 1 page in 500 would actually be found. This invisible material is often called "The Dark Web." The Netcraft survey of June 2000 visited about 16 million web sites. At the same time Google claimed to be the most comprehensive search engine with 2 million sites indexed. This meant that, at best, only one site in nine could then be found via the best search engine. Perhaps wisely, Google now does not claim a number of sites. Instead it claims (as of August, 2001) to index 1,387,529,000 web pages. Since the Netcraft survey for July 2001 showed 31 million sites (http://www.netcraft.com/Survey/Reports/200107/graphs.html), the implication is that the average site has only 44 pages which seems too few by a long way and suggests that a lot of sites are not being indexed at all. The reason seems to be that the search engines spend most of their time and energy fighting off "spam" attempts to get pages better ratings than they deserve. The spammers used CGI scripts long before databases became prevalent on the Web, so the search engines developed ways of detecting scripts. If their suspicions were triggered, suspect sites would not be indexed. No one outside the search-engine programming departments really knows the truth of the matter and they aren't telling but the mythology is that they don't like URLs that contain the characters: "!", "?"; the words "cgi-bin," or the like. Several commercial development systems betray themselves like this, but if you write your own scripts and serve them up with Apache, you can produce pages that cannot be distinguished from static HTML. Working with script2_html and the corresponding Config file shown earlier, the trick is this:
As a result, when you click the link, the URL that gets executed, and which the search engines see, is http://www.butterthlies.com/script2_html/whole_database. The fatal words cgi-bin have disappeared, and there is nothing to show that the page returned is not static HTML. Well, apart from the perhaps equally fatal words script or database, which might give the game away . . . but you get the idea. Another search-engine problem is that most of them cannot make their way through HTML frames. Since many web pages use them, this is a worry and makes one wonder whether the search engines are living in the same time frame as the rest of us. The answer is to provide a cruder home page, with links to all the pages you want indexed, in a <NOFRAMES> area. See your HTML reference book. A useful tool is a really old browser that also does not understand frames, so you can see your pages the way the search engines do. We use a Win 3.x copy of NCSA's Mosaic (download it from http://www.ncsa.uiuc.edu). The <NOFRAMES> tag will tend to pick out the search engines, but it is not infallible. A more positive way to detect their presence is to watch to see whether the client tries to open the file robots.txt. This is a standard filename that contains instructions to spiders to keep them to the parts of the site you want. See the tutorial at http://www.searchengineworld.com/robots/robots_tutorial.htm. The RFC is at http://www.robotstxt.org/wc/norobots-rfc.html. If the visitor goes for robots.txt, you can safely assume that it is a spider and serve up a simple dish. The search engines all have their own quirks. Google, for instance, ranks a site by the number of other pages that link to it which is democratic but tends to hide the quirky bit of information that just interests you. The engines come and go with dazzling rapidity, so if you are in for the long haul, it is probably best to register your site with the big ones and forget about the whole problem. One of us (PL) has a medical encyclopedia (http://www.medic-planet.com). It logs the visits of search engines. After a heart-stopping initial delay of about three months when nothing happened, it now gets visits from several spiders every day and gets a steady flow of visitors that is remarkably constant from month to month. If you want to make serious efforts to seduce the search engines, look for further information at http://searchengineforms.com and http://searchenginewatch.com. 16.4.5 DebuggingDebugging CGI scripts can be tiresome because until they are pretty well working, nothing happens on the browser screen. If possible, it is a good idea to test a script every time you change it by running it locally from the command line before you invoke it from the Web. Perl will scan it, looking for syntax errors before it tries to run it. These error reports, which you will find repeated in the error log when you run under Apache, will save you a lot of grief. Similarly, try out your MySQL calls from the command line to make sure they work before you embed them in a script. Keep an eye on the Apache error log: it will often give you a useful clue, though it can also be bafflingly silent even though things are clearly going wrong. A common cause of silent misbehavior is a bad call to MySQL. The DBI module never returns, so your script hangs without an explanation in the error log. As long as you have printed an HTTP header, something (but not necessarily what you want) will usually appear in the browser screen. You can use this fact to debug your scripts, by printing variables or by putting print markers GOT TO 1<BR>, GOT TO 2<BR> . . . through the code so that you can find out where it goes wrong. (<BR> is the HTML command for a newline). This doesn't always work because these debugging messages may appear in weird places on the screen or not at all depending on how thoroughly you have confused the browser. You can also print to error_log from your script: print STDERR "thing\n"; or to: warn "thing\n"; If you have an HTML document that sets up frames and you print anything else on the same page, they will not appear. This can be really puzzling. You can see the HTML that was actually sent to the browser by putting the cursor on the page, right-clicking the mouse, and selecting View Source (or similar, depending on your flavor of browser). When working with a database, it is often useful to print out the $query variable before the database is accessed. It is worth remembering that although scripts that invoke MySQL will often run from the command line (with various convincing error messages caused by variables not being properly set up), if queries go wrong when the script is run by Apache, they tend to hang without necessarily writing anything to error_log. Often the problem is caused by getting the quote marks wrong or by invoking incorrect field names in the query. A common, but enigmatic, message in error_log is: Premature end of script headers. This signals that the HTTP header went wrong and can be caused by several different mistakes:
Occasionally, these simple tricks do not work, and you need to print variables to a file to follow what is going on. If you print your error messages to STDERR, they will appear in the error log. Alternatively, if you want errors printed to your own file, remember that any program executed by Apache belongs to the useless webuser, and it can only write files without permission problems in webuser's home directory. You can often elicit useful error messages by using: open B,">>/home/webserver/script_errors" or die "couldn't open: $!"; close B; Sometimes you have to deal with a bit of script that prints no page. For instance, when WorldPay (described in Chapter 12) has finished with a credit card transaction, it can call a link to your web site again. You probably will want the script to write the details of the transaction to the database, but there is no browser to print debugging messages. The only way out is to print them to a file, as earlier. If you are programming your script in Perl, the CGI::Carp module can be helpful. However, most other languages[4] that you might want to use for CGI do not have anything so useful. 16.4.6 DebuggersIf you are programming in a high-level language and want to run a debugger, it is usually impossible to do so directly. However, it is possible to simulate the environment in which an Apache script runs. The first thing to do is to become the user that Apache runs as. Then, remember that Apache always runs a script in the script's own directory, so go to that directory. Next, Apache passes most of the information a script needs in environment variables. Determine what those environment variables should be (either by thinking about it or, more reliably, by temporarily replacing your CGI with one that executes env, as illustrated earlier), and write a little script that sets them then runs your CGI (possibly under a debugger). Since Apache sets a vast number of environment variables, it is worth knowing that most CGI scripts use relatively few of them usually only QUERY_STRING (or PATH_INFO, less often). Of course, if you wrote the script and all its libraries, you'll know what it used, but that isn't always the case. So, to give a concrete example, suppose we wanted to debug some script written in C. We'd go into .../cgi-bin and write a script called, say, debug.cgi, that looked something like this: #!/bin/sh QUERY_STRING='2315_order=20&2316_order=10&card_type=Amex' export QUERY_STRING gdb mycgi We'd run it by typing: chmod +x debug.cgi ./debug.cgi Once gdb came up, we'd hit r<CR>, and the script would run.[5] A couple of things may trip you up here. The first is that if the script expects the POST method that is, if REQUEST_METHOD is set to POST the script will (if it is working correctly) expect the QUERY_STRING to be supplied on its standard input rather than in the environment. Most scripts use a library to process the query string, so the simple solution is to not set REQUEST_METHOD for debugging, or to set it to GET instead. If you really must use POST, then the script would become: #!/bin/sh REQUEST_METHOD=POST export REQUEST_METHOD mycgi << EOF 2315_order=20&2316_order=10&card_type=Amex EOF Note that this time we didn't run the debugger, for the simple reason that the debugger also wants input from standard input. To accommodate that, put the query string in some file, and tell the debugger to use that file for standard input (in gdb 's case, that means type r < yourfile). The second tricky thing occurs if you are using Perl and the standard Perl module CGI.pm. In this case, CGI helpfully detects that you aren't running under Apache and prompts for the query string. It also wants the individual items separated by newlines instead of ampersands. The simple solution is to do something very similar to the solution to the POST problem we just discussed, except with newlines. 16.4.7 SecuritySecurity should be the sensible webmasters' first and last concern. This list of questions, all of which you should ask yourself, is from Sysadmin: The Journal for Unix System Administrators, at http://www.samag.com/current/feature.shtml. See also Chapter 11 and Chapter 12.
Perl can help. Put this at the top of your scripts: #! /usr/local/bin/perl -w -T use strict; .... The -w flag to Perl prints various warning messages at runtime. -T switches on taint checking, which prevents the malicious program the Bad Guys send you disguised as data doing anything bad. The line use strict checks that your variables are properly declared. On security questions in general, you might like to look at Lincoln Stein's well regarded "Secure CGI FAQ" at http://www-genome.wi.mit.edu/WWW/faqs/www-security-faq.html. 16.5 Script DirectivesApache has five directives dealing with CGI scripts.
The ScriptAlias directive does two things. It sets Apache up to execute CGI scripts, and it converts requests for URLs starting with URLpathto execution of the script in CGIpath. For example: ScriptAlias /bin /usr/local/apache/cgi-bin An incoming URL like www.butterthlies.com/bin/fred will run the script /usr/local/apache/cgi-bin/fred. Note that CGIpath must be an absolute path, starting at /. A very useful feature of ScriptAlias is that the incoming URL can be loaded with fake subdirectories. Thus, the incoming URL www.butterthlies.com/bin/fred/purchase/learjetwill run .../fred as before, but will also make the text purchase/learjet available to fred in the environment variable PATH_INFO. In this way you can write a single script to handle a multitude of different requests. You just need to monitor the command-line arguments at the top and dispatch the requests to different subroutines.
This directive is equivalent to ScriptAlias but makes use of standard regular expressions instead of simple prefix matching. The supplied regular expression is matched against the URL; if it matches, the server will substitute any parenthesized matches into the given string and use the result as a filename. For example, to activate any script in /cgi-bin, one might use the following: ScriptAliasMatch /cgi-bin/(.*) /usr/local/apache/cgi-bin/$1 If the user is sent by a link to http://www.butterthlies.com/cgi-bin/script3, "/cgi-bin/"matches against /cgi-bin/. We then have to match script3 against .*, which works, because "." means any character and "*" means any number of whatever matches ".". The parentheses around .* tell Apache to store whatever matched to .* in the variable $1. (If some other pattern followed, also surrounded by parentheses, that would be stored in $2). In the second part of the line, ScriptAliasMatch is told, in effect, to run /usr/local/apache/cgi-bin/script3.
Since debugging CGI scripts can be rather opaque, this directive allows you to choose a log file that shows what is happening with CGIs. However, once the scripts are working, disable logging, since it slows Apache down and offers the Bad Guys some tempting crannies.
This directive specifies the maximum length of the debug log. Once this value is exceeded, logging stops (after the last complete message).
This directive specifies the maximum size in bytes for recording a POST request.
Scripts can go wild and monopolize system resources: this unhappy outcome can be controlled by three directives.
RLimitCPU takes one or two parameters. Each parameter may be a number or the word max,which invokes the system maximum, in seconds per process. The first parameter sets the soft resource limit; the second the hard limit.[6]
RLimitMEM takes one or two parameters. Each parameter may be a number or the word max,which invokes the system maximum, in bytes of memory used per process. The first parameter sets the soft resource limit; the second the hard limit.
RLimitNPROC takes one or two parameters. Each parameter may be a number or the word max, which invokes the system maximum, in processes per user. The first parameter sets the soft resource limit; the second the hard limit. 16.6 suEXEC on UnixThe vulnerability of servers running scripts is a continual source of concern to the Apache Group. Unix systems provide a special method of running CGIs that gives much better security via a wrapper. A wrapper is a program that wraps around another program to change the way it operates. Usually this is done by changing its environment in some way; in this case, it makes sure it runs as if it had been invoked by an appropriate user. The basic security problem is that any program or script run by Apache has the same permissions as Apache itself. Of course, these permissions are not those of the superuser, but even so, Apache tends to have permissions powerful enough to impair the moral development of a clever hacker if he could get his hands on them. Also, in environments where there are many users who can write scripts independently of each other, it is a good idea to insulate them from each other's bugs, as much as is possible. suEXEC reduces this risk by changing the permissions given to a program or script launched by Apache. To use it, you should understand the Unix concepts of user and group execute permissions on files and directories. suEXEC is executed whenever an HTTP request is made for a script or program that has ownership or group-membership permissions different from those of Apache itself, which will normally be those appropriate to webuser of webgroup. The documentation says that suEXEC is quite deliberately complicated so that "it will only be installed by users determined to use it." However, we found it no more difficult than Apache itself to install, so you should not be deterred from using what may prove to be a very valuable defense. If you are interested, please consult the documentation and be guided by it. What we have written in this section is intended only to help and encourage, not to replace the words of wisdom. See http://httpd.apache.org/docs/suexec.html. To install suEXEC to run with the demonstration site site.suexec, go to the support subdirectory below the location of your Apache source code. Edit suexec.h to make the following changes to suit your installation. What we did, to suit our environment, is shown marked by /**CHANGED**/: /* * HTTPD_USER -- Define as the username under which Apache normally * runs. This is the only user allowed to execute * this program. */ #ifndef HTTPD_USER #define HTTPD_USER "webuser" /**CHANGED**/ #endif /* * UID_MIN -- Define this as the lowest UID allowed to be a target user * for suEXEC. For most systems, 500 or 100 is common. */ #ifndef UID_MIN #define UID_MIN 100 #endif The point here is that many systems have "privileged" users below some number (e.g., root, daemon, lp, and so on), so we can use this setting to avoid any possibility of running a script as one of these users: /* * GID_MIN -- Define this as the lowest GID allowed to be a target group * for suEXEC. For most systems, 100 is common. */ #ifndef GID_MIN #define GID_MIN 100 // see UID above #endif Similarly, there may be privileged groups: /* * USERDIR_SUFFIX -- Define to be the subdirectory under users' * home directories where suEXEC access should * be allowed. All executables under this directory * will be executable by suEXEC as the user so * they should be "safe" programs. If you are * using a "simple" UserDir directive (ie. one * without a "*" in it) this should be set to * the same value. suEXEC will not work properly * in cases where the UserDir directive points to * a location that is not the same as the user's * home directory as referenced in the passwd file. * * If you have VirtualHosts with a different * UserDir for each, you will need to define them to * all reside in one parent directory; then name that * parent directory here. IF THIS IS NOT DEFINED * PROPERLY, ~USERDIR CGI REQUESTS WILL NOT WORK! * See the suEXEC documentation for more detailed * information. */ #ifndef USERDIR_SUFFIX #define USERDIR_SUFFIX "/usr/www/APACHE3/cgi-bin" /**CHANGED**/ #endif /* * LOG_EXEC -- Define this as a filename if you want all suEXEC * transactions and errors logged for auditing and * debugging purposes. */ #ifndef LOG_EXEC #define LOG_EXEC "/usr/www/APACHE3/suexec.log" /**CHANGED**/ #endif /* * DOC_ROOT -- Define as the DocumentRoot set for Apache. This * will be the only hierarchy (aside from UserDirs) * that can be used for suEXEC behavior. */ #ifndef DOC_ROOT #define DOC_ROOT "/usr/www/APACHE3/site.suexec/htdocs" /**CHANGED**/ #endif /* * SAFE_PATH -- Define a safe PATH environment to pass to CGI executables. * */ #ifndef SAFE_PATH #define SAFE_PATH "/usr/local/bin:/usr/bin:/bin" #endif Compile the file to make suEXEC executable by typing: make suexec
and copy it to a sensible location (this will very likely be different on your site replace /usr/local/bin with whatever is appropriate) alongside Apache itself with the following: cp suexec /usr/local/bin
You then have to set its permissions properly by making yourself the superuser (or persuading the actual, human superuser to do it for you if you are not allowed to) and typing: chown root /usr/local/bin/suexec chmod 4711 /usr/local/bin/suexec The first line gives suEXEC the owner root; the second sets the setuserid execution bit for file modes. You then have to tell Apache where to find the suEXEC executable by editing . . . src/include/httpd.h. Welooked for "suEXEC" and changed it thus: /* The path to the suExec wrapper; can be overridden in Configuration */ #ifndef SUEXEC_BIN #define SUEXEC_BIN "/usr/local/bin/suexec" /**CHANGED**/ #endif This line was originally: #define SUEXEC_BIN HTTPD_ROOT "/sbin/suexec" Notice that the macro HTTPD_ROOT has been removed. It is easy to leave it in by mistake we did the first time around but it prefixes /usr/local/apache (or whatever you may have changed it to) to the path you type in, which may not be what you want to happen. Having done this, you remake Apache by getting into the .../src directory and typing: make cp httpd /usr/local/bin or wherever you want to keep the executable. When you start Apache, nothing appears to be different, but a message appears in .../logs/error_log :[7] suEXEC mechanism enabled (wrapper: /usr/local/bin/suexec) We think that something as important as suEXEC should have a clearly visible indication on the command line and that an entry in a log file is not immediate enough. To turn suEXEC off, you simply remove the executable or, more cautiously, rename it to, say, suexec.not. Apache then can't find it and carries on without comment. Once suEXEC is running, it applies many tests to any CGI or server-side include (SSI) script invoked by Apache. If any of the tests fail, a note will appear in the suexec.log file that you specified (as the macro LOG_EXEC in suexecx.h) when you compiled suEXEC. A comprehensive list appears in the documentation and also in the source. Many of these tests can only fail if there is a bug in Apache, suEXEC, or the operating system, or if someone is attempting to misuse suEXEC. We list here the notes that you are likely to encounter in normal operation, since you should never come across the others. If you do, suspect the worst:
If all these hurdles are passed, then the program executes. In setting up your system, you have to bear these hurdles in mind. Note that once suEXEC has decided it will execute your script, it then makes it even safer by cleaning the environment that is, deleting any environment variables not on its list of safe ones and replacing the PATH with the path defined in SAFE_PATH in suexec.h. The list of safe environment variables can be found in .../src/support/suexec.c in the variable safe_env_lst. This list includes all the standard variables passed to CGI scripts. Of course, this means that any special-purpose variables you set with SetEnv or PassEnv directives will not make it to your CGI scripts unless you add them to suexec.c. 16.6.1 A Demonstration of suEXECSo far, for the sake of simplicity, we have been running everything as root, to which all things are possible. To demonstrate suEXEC, we need to create a humble but ill-intentioned user, Peter, who will write and run a script called badcgi.cgi intending to do harm to those around. badcgi.cgisimply deletes /usr/victim/victim1 as a demonstration of its power but it could do many worse things. This file belongs to webuser and webgroup. Normally, Peter, who is not webuser and does not belong to webgroup, would not be allowed to do anything to it, but if he gets at it through Apache (undefended by suEXEC ), he can do what he likes. Peter creates himself a little web site in his home directory, /home/peter, which contains the directories: conf logs public_html and the usual file go: httpd -d /home/peter The Config file is: User webuser Group webgroup ServerName www.butterthlies.com ServerAdmin sales@butterthlies.com UserDir public_html AddHandler cgi-script cgi Most of this is relevant in the present situation. By specifying webuser and webgroup, we give any program executed by Apache that user and group. In our guise of Peter, we are going to ask the browser to log onto httpd://www.butter-thlies.com/~peter that is, to the home directory of Peter on the computer whose port answers to www.butterthlies.com. Once in that home directory, we are referred totheUserDir public_html,which acts pretty much the same as DocumentRoot in the web sites with which we have been playing. Peter puts an innocent-looking Butterthlies form, form_summer.html, into public_html. But it conceals a viper! Instead of having ACTION="mycgi.cgi", as innocent forms do, this one calls badcgi.cgi, which looks like this: #!/bin/sh echo "Content-Type: text/plain" echo rm -f /usr/victim/victim1 This is a script of unprecedented villainy, whose last line will utterly destroy and undo the innocent file victim1. Remembering that any CGI script executed by Apache has only the user and group permissions specified in the Config file that is, webuser and webgroup we go and make the target file the same, by logging on as root and typing: chown webuser:webgroup /usr/victim chown webuser:webgroup /usr/victim/victim1 Now, if we log on as Peter and execute badcgi.cgi, we are roundly rebuffed: ./badcgi.cgi rm: /usr/victim/victim1: Permission denied This is as it should be Unix security measures are working. However, if we do the same thing under the cloak of Apache, by logging on as root and executing: /home/peter/go
and then, on the browser, accessing http://www.butterthlies.com/~peter, opening form_summer.html, and clicking the Submit button at the bottom of the form, we see that the browser is accessing www.butterthlies.com/~peter/badcgi.cgi, and we get the warning message: Document contains no data This statement is regrettably true because badcgi.cgi now has the permissions ofwebuser and webgroup ; it can execute in the directory /usr/victim, and it has removed the unfortunate victim1 in insolent silence. So much for what an in-house Bad Guy could do before suEXEC came along. If we now replace victim1, stop Apache, rename suEXEC.not to suEXEC, restart Apache (checking that the .../logs/error_log file shows that suEXEC started up), and click Submit on the browser again, we get the following comforting message: Internal Server Error The server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, sales@butterthlies.com and inform them of the time the error occurred, and anything you might have done that may have caused the error. The error log contains the following: [Tue Sep 15 13:42:53 1998] [error] malformed header from script. Bad header=suexec running: /home/peter/public_html/badcgi.cgi Ha, ha! 16.7 HandlersA handler is a piece of code built into Apache that performs certain actions when a file with a particular MIME or handler type is called. For example, a file with the handler type cgi-script needs to be executed as a CGI script. This is illustrated in ... /site.filter. Apache has a number of handlers built in, and others can be added with the Actions command (see the next section). The built-in handlers are as follows:
The corresponding directives follow.
AddHandler wakes up an existing handler and maps the filename(s) extension1, etc., to handler-name. You might specify the following in your Config file: AddHandler cgi-script cgi bzq From then on, any file with the extension .cgi or .bzq would be treated as an executable CGI script.
This does the same thing as AddHandler, but applies the transformation specified by handler-name to all files in the <Directory>, <Location>, or <Files> section in which it is placed or in the .htaccess directory. For instance, in Chapter 10, we write: <Location /status> <Limit get> order deny,allow allow from 192.168.123.1 deny from all </Limit> SetHandler server-status </Location>
The RemoveHandler directive removes any handler associations for files with the given extensions. This allows .htaccess files in subdirectories to undo any associations inherited from parent directories or the server config files. An example of its use might be: /foo/.htaccess: AddHandler server-parsed .html /foo/bar/.htaccess: RemoveHandler .html This has the effect of treating .html files in the /foo/bar directory as normal files, rather than as candidates for parsing (see the mod_include module). The extension argument is case insensitive and can be specified with or without a leading dot. 16.8 ActionsA related notion to that of handlers is actions (nothing to do with HTML form "Action" discussed earlier). An action passes specified files through a named CGI script before they are served up. Apache v2 has the somewhat related "Filter" mechanism. 16.8.1 ActionAction type cgi_script Server config, virtual host, directory, .htaccess The cgi_script is applied to any file of MIME or handler type matching type whenever it is requested. This mechanism can be used in a number of ways. For instance, it can be handy to put certain files through a filter before they are served up on the Web. As a simple example, suppose we wanted to keep all our .html files in compressed format to save space and to decompress them on the fly as they are retrieved. Apache happily does this. We make site.filter a copy of site.first, except that the httpd.conf file is as follows: User webuser Group webgroup ServerName localhost DocumentRoot /usr/www/APACHE3/site.filter/htdocs ScriptAlias /cgi-bin /usr/www/APACHE3/cgi-bin AccessConfig /dev/null ResourceConfig /dev/null AddHandler peter-zipped-html zhtml Action peter-zipped-html /cgi-bin/unziphtml <Directory /usr/www/APACHE3/site.filter/htdocs> DirectoryIndex index.zhtml </Directory> The points to notice are that:
The CGI script ... /cgi-bin/unziphtml contains the following: #!/bin/sh echo "Content-Type: text/html" echo gzip -S .zhtml -d -c $PATH_TRANSLATED This applies gzip with the following flags:
gzip is applied to the file contained in the environment variable PATH_TRANSLATED. Finally, we have to turn our .htmls into .zhtmls. In ... /htdocs we have compressed and renamed:
It would be simpler to leave them as gzip does (with the extension .html.gz), but a file extension that maps to a MIME type (described in Chapter 16) cannot have a "." in it.[8] We also have index.html, which we want to convert, but we have to remember that it must call up the renamed catalogs with .zhtml extensions. Once that has been attended to, we can gzip it and rename it to index.zhtml. We learned that Apache automatically serves up index.html if it is found in a directory. But this won't happen now, because we have index.zhtml. To get it to be produced as the index, we need the DirectoryIndex directive (see Chapter 7), and it has to be applied to a specified directory: <Directory /usr/www/APACHE3/site.filter/htdocs> DirectoryIndex index.zhtml </Directory> Once all that is done and ./go is run, the page looks just as it did before. 16.9 BrowsersOne complication of the Web is that people are free to choose their own browsers, and not all browsers work alike or even nearly alike. They vary enormously in their capabilities. Some browsers display images; others won't. Some that display images won't display frames, tables, Java, and so on. You can try to circumvent this problem by asking the customer to go to different parts of your script ("Click here to see the frames version"), but in real life people often do not know what their browser will and won't do. A lot of them will not even understand what question you are asking. To get around this problem, Apache can detect the browser type and set environment variables so that your CGI scripts can detect the type and act accordingly.
The attribute can be one of the HTTP request header fields, such as Host, User-Agent, Referer, and/or one of the following:
The NoCase version works the same except that regular-expression matching is evaluated without regard to letter case.
regex is a regular expression matched against the client's User-Agent header, and env1, env2, ... are environment variables to be set if the regular expression matches. The environment variables are set to value1, value2, etc., if present. So, for instance, we might say: BrowserMatch ^Mozilla/[23] tables=3 java The symbol ^ means start from the beginning of the header and match the string Mozilla/ followed by either a 2 or 3. If this is successful, then Apache creates and, if required, specifies values for the given list of environment variables. These variables are invented by the author of the script, and in this case they are: tables=3 java In this CGI script, these variables can be tested and take the appropriate action. BrowserMatchNoCase is simply a case-blind version of BrowserMatch. That is, it doesn't care whether letters are upper- or lowercase. mOZILLA works as well as MoZiLlA. Note that there is no difference between BrowserMatch and SetEnvIf User-Agent. BrowserMatch exists for backward compatibility.
This disables KeepAlive (see Chapter 3). Some versions of Netscape claimed to support KeepAlive, but they actually had a bug that meant the server appeared to hang (in fact, Netscape was attempting to reuse the existing connection, even though the server had closed it). The directive: BrowserMatch "Mozilla/2" nokeepalive disables KeepAlive for those buggy versions.[9]
This forces Apache to respond with HTTP 1.0 to an HTTP 1.0 client, instead of with HTTP 1.1, as is called for by the HTTP 1.1 spec. This is required to work around certain buggy clients that don't recognize HTTP 1.1 responses. Various clients have this problem. The current recommended settings are as follows:[10] # # The following directives modify normal HTTP response behavior. # The first directive disables keepalive for Netscape 2.x and browsers that # spoof it. There are known problems with these browser implementations. # The second directive is for Microsoft Internet Explorer 4.0b2 # which has a broken HTTP/1.1 implementation and does not properly # support keepalive when it is used on 301 or 302 (redirect) responses. # BrowserMatch "Mozilla/2" nokeepalive BrowserMatch "MSIE 4\.0b2;" nokeepalive downgrade-1.0 force-response-1.0 # # The following directive disables HTTP/1.1 responses to browsers which # are in violation of the HTTP/1.0 spec by not being able to grok a # basic 1.1 response. # BrowserMatch "RealPlayer 4\.0" force-response-1.0 BrowserMatch "Java/1\.0" force-response-1.0 BrowserMatch "JDK/1\.0" force-response-1.0
This forces Apache to downgrade to HTTP 1.0 even though the client is HTTP 1.1 (or higher). Microsoft Internet Explorer 4.0b2 earned the dubious distinction of being the only known client to require all three of these settings: BrowserMatch "MSIE 4\.0b2;" nokeepalive downgrade-1.0 force-response-1.0
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|